Universal Kriging (UK)
Contents
Universal Kriging (UK)#
Universal Kriging (UK) is a variant of the Ordinary Kriging under non-stationary condition where mean differ in a deterministic way in different locations (local trend or drift), while only the variance is constant. This second-order stationarity (“weak stationarity”) is often a pertinent assumption with environmental exposures. In UK, usually first trend is calculated as a function of the coordinates and then the variation in what is left over (the residuals) as a random field is added to trend for making final prediction.
Where the \(f_{k}\) are some global functions of position \(\vec{x}\) and the \(\beta_{k}\) are the coefficients.
The \(f\) are called base functions. The \(\varepsilon(\vec{x})\) is the spatially-correlated error, which is modelled as before, with a variogram, but now only considering the residuals, after the global trend is removed.
Note
The definition above came from a geospatial data science course created by Prof. Zia Ahmed at The State of New York University at Buffalo.
Thanks Prof. Zia Ahmed for the great resource!
Load python modules
[1]:
import context
import salem
import numpy as np
import pandas as pd
import xarray as xr
import geopandas as gpd
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
from pykrige.uk import UniversalKriging
import plotly.express as px
from datetime import datetime
from utils.utils import pixel2poly, plotvariogram, cfcompliant
from context import data_dir
******************************
context imported. Front of path:
/Users/rodell/krige-smoke
/Users/rodell/krige-smoke/docs/source
******************************
through /Users/rodell/krige-smoke/docs/source/context.py -- pha
Load Data#
Open the reformated data with the linear, meter-based Lambert projection (EPSG:3347). - Again this is helpful as lat/lon coordinates are less suitable for measuring distances which is important for spatial interpolation.
[2]:
df = pd.read_csv(str(data_dir) + "/obs/gpm25.csv")
gpm25 = gpd.GeoDataFrame(
df,
crs="EPSG:4326",
geometry=gpd.points_from_xy(df["lon"], df["lat"]),
).to_crs("EPSG:3347")
gpm25["Easting"], gpm25["Northing"] = gpm25.geometry.x, gpm25.geometry.y
gpm25.head()
[2]:
| Unnamed: 0 | id | datetime | lat | lon | PM2.5 | geometry | Easting | Northing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 42073 | 2021-07-16 22:00:00 | 47.185173 | -122.176855 | 0.862 | POINT (3972238.901 1767531.888) | 3.972239e+06 | 1.767532e+06 |
| 1 | 3 | 53069 | 2021-07-16 22:00:00 | 47.190197 | -122.177992 | 1.078 | POINT (3972419.863 1768071.699) | 3.972420e+06 | 1.768072e+06 |
| 2 | 12 | 10808 | 2021-07-16 22:00:00 | 40.507316 | -111.899188 | 9.780 | POINT (4460286.051 743178.640) | 4.460286e+06 | 7.431786e+05 |
| 3 | 16 | 85391 | 2021-07-16 22:00:00 | 48.454213 | -123.283643 | 0.874 | POINT (3964698.001 1931774.531) | 3.964698e+06 | 1.931775e+06 |
| 4 | 21 | 79095 | 2021-07-16 22:00:00 | 47.672130 | -122.514183 | 0.784 | POINT (3974631.739 1827689.201) | 3.974632e+06 | 1.827689e+06 |
Create Grid#
Again we will create a grid that we want to use for the interpolation.
The grid in the fromate of a dataset is helpful for reprojecting our covariates to match the interpolated grid.
[3]:
## define the desired grid resolution in meters
resolution = 20_000 # grid cell size in meters
## make grid based on dataset bounds and resolution
gridx = np.arange(
gpm25.bounds.minx.min() - resolution,
gpm25.bounds.maxx.max() + resolution,
resolution,
)
gridy = np.arange(
gpm25.bounds.miny.min() - resolution,
gpm25.bounds.maxy.max() + resolution,
resolution,
)
## use salem to create a dataset with the grid.
krig_ds = salem.Grid(
nxny=(len(gridx), len(gridy)),
dxdy=(resolution, resolution),
x0y0=(gpm25.bounds.minx.min(), gpm25.bounds.miny.min()),
proj="epsg:3347",
pixel_ref="corner",
).to_dataset()
## print dataset
krig_ds
[3]:
<xarray.Dataset>
Dimensions: (x: 147, y: 124)
Coordinates:
* x (x) float64 3.46e+06 3.48e+06 3.5e+06 ... 6.36e+06 6.38e+06
* y (y) float64 4.984e+05 5.184e+05 5.384e+05 ... 2.938e+06 2.958e+06
Data variables:
*empty*
Attributes:
pyproj_srs: +proj=lcc +lat_0=63.390675 +lon_0=-91.8666666666667 +lat_1=4...Covariate#
We will use the Bluesky Canada Smoke Forecast (BlueSky) as a covariate for universal kriging with specified drift. The data comes from firesmoke.ca
[4]:
ds = salem.open_xr_dataset(str(data_dir) + f"/dispersion1.nc")
Set up specified drift#
For specified we need the modeled derived PM2.5 concentration from BlueSky at every aq monitor location and BleuSky modeled PM2.5 concentration on the same grid we are interpolating.
BlueSky modeled PM2.5 concentration at AQs location#
[5]:
y = xr.DataArray(
np.array(df["lat"]),
dims="ids",
coords=dict(ids=df.id.values),
)
x = xr.DataArray(
np.array(df["lon"]),
dims="ids",
coords=dict(ids=df.id.values),
)
var_points = ds["pm25"].interp(x=x, y=y, method="linear")
# print(var_points)
if len(df.index) == len(var_points.values):
var_points = var_points.values
else:
raise ValueError("Lengths dont match")
BlueSky PM2.5 Data on grid#
Now we will transform the BlueSky PM2.5 data to be on the grid we are interpolating too. This is feed in as a specified drift array when executing the interpolation.
[6]:
ds_T = krig_ds.salem.transform(ds)
var_array = ds_T["pm25"].values
Plot BlueSky PM2.5#
[7]:
ax = plt.axes(projection=ccrs.Orthographic(-80, 35))
ax.set_global()
ds["pm25"].plot(
ax=ax,
transform=ccrs.PlateCarree(),
levels=[0, 5, 10, 20, 40, 80, 160, 300, 600],
cmap="Reds",
)
ax.coastlines()
ax.set_extent([-132, -85, 35, 65], crs=ccrs.PlateCarree())
Setup UK#
[8]:
nlags = 15
variogram_model = "spherical"
startTime = datetime.now()
krig = UniversalKriging(
x=gpm25["Easting"], ## x location of aq monitors in lambert conformal
y=gpm25["Northing"], ## y location of aq monitors in lambert conformal
z=gpm25["PM2.5"], ## measured PM 2.5 concentrations at locations
drift_terms=["specified"],
variogram_model=variogram_model,
nlags=nlags,
specified_drift=[var_points], ## BlueSky PM2.5 at aq monitors
)
print(f"UK build time {datetime.now() - startTime}")
UK build time 0:03:30.516893
PyKrige will optimize most parameters based on user defined empirical model and the number of bins.
I tested several empirical models and bin sizes and found (for this case study) that a spherical model with 15 bins was optimal based on the output statics.
The literature supports spherical for geospatial interpolation applications over other methods.
[9]:
plotvariogram(krig)
Execute UK#
Interpolate data to our grid using UK with specified drift. Where the specified drift is the linear correlation of BlueSky PM2.5 to PM2.5 at all locations and on the interploated grid for kriging.
[10]:
var_array[var_array > np.max(var_points)] = np.max(var_points) + 20
startTime = datetime.now()
z, ss = krig.execute("grid", gridx, gridy, specified_drift_arrays=[var_array])
print(f"UK execution time {datetime.now() - startTime}")
UK_pm25 = np.where(z < 0, 0, z)
krig_ds["UK_pm25"] = (("y", "x"), UK_pm25)
UK execution time 0:00:07.096796
Plot UK#
Convert data to polygons to be plot-able on a slippy mapbox. This is not necessary but but :)
[11]:
polygons, values = pixel2poly(gridx, gridy, UK_pm25, resolution)
pm25_model = gpd.GeoDataFrame(
{"Modelled PM2.5": values}, geometry=polygons, crs="EPSG:3347"
).to_crs("EPSG:4326")
fig = px.choropleth_mapbox(
pm25_model,
geojson=pm25_model.geometry,
locations=pm25_model.index,
color="Modelled PM2.5",
color_continuous_scale="jet",
center={"lat": 50.0, "lon": -110.0},
zoom=2.5,
mapbox_style="carto-positron",
opacity=0.6,
)
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))
fig.update_traces(marker_line_width=0)